Efficient Distributed Decision Trees for Robust Regression [Technical Report]

نویسندگان

  • Tian Guo
  • Konstantin Kutzkov
  • Mohamed Ahmed
  • Jean-Paul Calbimonte
  • Karl Aberer
چکیده

The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predicting The Type of Malaria Using Classification and Regression Decision Trees

Predicting The Type of Malaria Using Classification and Regression Decision Trees Maryam Ashoori1 *, Fatemeh Hamzavi2 1School of Technical and Engineering, Higher Educational Complex of Saravan, Saravan, Iran 2School of Agriculture, Higher Educational Complex of Saravan, Saravan, Iran Abstract Background: Malaria is an infectious disease infecting 200 - 300 million people annually. Environme...

متن کامل

Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio

Ability for accurate hospital case cost modelling and prediction is critical for efficient health care financial management and budgetary planning. A variety of regression machine learning algorithms are known to be effective for health care cost predictions. The purpose of this experiment was to build an Azure Machine Learning Studio tool for rapid assessment of multiple types of regression mo...

متن کامل

A Robust Control Strategy for Distributed Generations in Islanded Microgrids

This paper presents a robust control scheme for distributed generations (DGs) in islanded mode operation of a microgrid (MG). In this strategy, assuming a dynamic slack bus with constant voltage magnitude and phase angle, nonlinear equations of the MG are solved in the slack-voltage-oriented synchronous reference frame, and the instantaneous active and reactive power reference for the slack bus...

متن کامل

ارائه مدلی برای پیش‌بینی نوع صافی همودیالیز با تکنیک‌های داده‌کاوی

Introduction: Inadequate dialysis for patients' kidneys as a mortality risk necessitates the presence of a pattern to assist staff in dialysate part to provide the proper services for dialysis patients and also the proper management of their treatment. Since the role of buffer type in the adequacy of dialysis is determinative, the present study is aimed at determining hemodialysis buffer type. ...

متن کامل

School of IT Technical Report DECISION TREES FOR IMBALANCED DATA SETS

We propose a new variant of decision tree for imbalanced classification. Decision trees use a greedy approach based on information gain to select the attribute to split. We express information again in terms of confidence and show that like confidence, information gain is biased towards the majority class. We overcome the bias of information gain by embedding a new measure, the ratio of confide...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016